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Abstract 

We evaluate whether features extracted from 
the activation of a deep convolutional network 
trained in a fully supervised fashion on a large, 
fixed set of object recognition tasks can be re- 
purposed to novel generic tasks. Our generic 
tasks may differ significantly from the originally 
trained tasks and there may be insufficient la- 
beled or unlabeled data to conventionally train or 
adapt a deep architecture to the new tasks. We in- 
vestigate and visualize the semantic clustering of 
deep convolutional features with respect to a va- 
riety of such tasks, including scene recognition, 
domain adaptation, and fine-grained recognition 
challenges. We compare the efficacy of relying 
on various network levels to define a fixed fea- 
ture, and report novel results that significantly 
outperform the state-of-the-art on several impor- 
tant vision challenges. We are releasing DeCAF, 
an open-source implementation of these deep 
convolutional activation features, along with all 
associated network parameters to enable vision 
researchers to be able to conduct experimenta- 
tion with deep representations across a range of 
visual concept learning paradigms. 

1. Introduction 

Discovery of effective representations that capture salient 
semantics for a given task is a key goal of perceptual 
learning. Performance with conventional visual representa- 
tions, based on flat feature representations involving quan- 
tized gradient filters, has been impressive but has likely 
plateaued in recent years. 

It has long been argued that deep or layered composi- 
tional architectures should be able to capture salient as- 
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pects of a given domain through discovery of salient clus- 
ters, parts, mid-level features, and/or hidden units (Hin- 
ton & Salakhutdinov, 2006; Fidler & Leonardis, 2007; Zhu 
et al., 2007; Singh et al., 2012; Krizhevsky et al., 2012). 
Such models have been able to perform better than tradi- 
tional hand-engineered representations in many domains, 
especially those where good features have not already been 
engineered (Le et al., 2011). Recent results have shown 
that moderately deep unsupervised models outperform the 
state-of-the art gradient histogram features in part-based 
detection models (Ren & Ramanan, 2013). 

Deep models have recently been applied to large-scale 
visual recognition tasks, trained via back-propagation 
through layers of convolutional filters (LeCun et al., 1989). 
These models perform extremely well in domains with 
large amounts of training data, and had early success in 
digit classification tasks (LeCun et al., 1998). With the 
advent of large scale sources of category-level training 
data, e.g., (Deng et al., 2009), and efficient implementa- 
tion with on-line approximate model averaging (“dropout”) 
(Krizhevsky et al., 2012), they have recently outperformed 
all known methods on a large scale recognition challenge 
(Berg et al., 2012). 

With limited training data, however, fully-supervised 
deep architectures with the representational capacity of 
(Krizhevsky et al., 2012) will generally dramatically overfit 
the training data. In fact, many conventional visual recog- 
nition challenges have tasks with few training examples; 
e.g., when a user is defining a category “on-the-fly” us- 
ing specific examples, or for fine-grained recognition chal- 
lenges (Welinder et al., 2010), attributes (Bourdev et al., 
2011), and/or domain adaptation (Saenko et al., 2010). 

In this paper we investigate semi-supervised multi-task 
learning of deep convolutional representations, where rep- 
resentations are learned on a set of related problems but 
applied to new tasks which have too few training exam- 
ples to learn a full deep representation. Our model can ei- 
ther be considered as a deep architecture for transfer learn- 
ing based on a supervised pre-training phase, or simply 
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as a new visual feature DeCAF defined by the convolu- 
tional network weights learned on a set of pre-defined ob- 
ject recognition tasks. Our work is also related to represen- 
tation learning schemes in computer vision which form an 
intermediate representation based on learning classifiers on 
related tasks (Li et al., 2010; Torresani et al., 2010; Quat- 
toni et al., 2008). 

Our main result is the empirical validation that a generic 
visual feature based on a convolutional network weights 
trained on ImageNet outperforms a host of conventional vi- 
sual representations on standard benchmark object recog- 
nition tasks, including Caltech- 101 (Fei-Fei et al., 2004), 
the Office domain adaptation dataset (Saenko et al., 
2010), the Caltech-UCSD Birds fine-grained recognition 
dataset (Welinder et al., 2010), and the SUN- 3 97 scene 
recognition database (Xiao et al., 2010). 

Further, we analyze the semantic salience of deep convo- 
lutional representations, comparing visual features defined 
from such networks to conventional representations. In 
Section 3, we visualize the semantic clustering properties 
of deep convolutional features compared to baseline rep- 
resentations, and find that convolutional features appear to 
cluster semantic topics more readily than conventional fea- 
tures. Finally, while conventional deep learning can be 
computationally expensive, we note that the run-time and 
resource computation of deep-learned convolutional fea- 
tures are not exceptional in comparison to existing features 
such as HOG (Dalai & Triggs, 2005) or KDES (Bo et al., 
2010 ). 

2. Related work 

Deep convolutional networks have a long history in com- 
puter vision, with early examples showing successful re- 
sults on using supervised back-propagation networks to 
perform digit recognition (LeCun et al., 1989). More re- 
cently, these networks, in particular the convolutional net- 
work proposed by Krizhevsky et al. (2012), have achieved 
competition- winning numbers on large benchmark datasets 
consisting of more than one million images, such as Ima- 
geNet (Berg et al., 2012). 

Learning from related tasks also has a long history in ma- 
chine learning beginning with Caruana (1997) and Thrun 
(1996). Later works such as Argyriou et al. (2006) devel- 
oped efficient frameworks for optimizing representations 
from related tasks, and Ando & Zhang (2005) explored how 
to transfer parameter manifolds to new tasks. In computer 
vision, forming a representation based on sets of trained 
classifiers on related tasks has recently been shown to be 
effective in a variety of retrieval and classification settings, 
specifically using classifiers based on visual category de- 
tectors (Torresani et al., 2010; Li et al., 2010). A key ques- 


tion for such learning problems is to find a feature represen- 
tation that captures the object category related information 
while discarding noise irrelevant to object category infor- 
mation such as illumination. 

Transfer learning across tasks using deep representations 
has been extensively studied, especially in an unsupervised 
setting (Raina et al., 2007; Mesnil et al., 2012). However, 
reported successes with such models in convolutional net- 
works have been limited to relatively small datasets such 
as CIFAR and MNIST, and efforts on larger datasets have 
had only modest success (Le et al., 2012). We investi- 
gate the “supervised pre-training” approach proven suc- 
cessful in computer vision and multimedia settings using a 
concept-bank paradigm (Kennedy & Hauptmann, 2006; Li 
et al., 2010; Torresani et al., 2010) by learning the features 
on large-scale data in a supervised setting, then transferring 
them to different tasks with different labels. 

To evaluate the generality of a representation formed from 
a deep convolutional feature trained on generic recognition 
tasks, we consider training and testing on datasets known 
to have a degree of dataset bias with respect to ImageNet. 
We evaluate on the SUN- 3 97 scene dataset, as well as 
datasets used to evaluate domain adaptation performance 
directly (Chopra et al., 2013; Kulis et al., 2011). This eval- 
uates whether the learned features could undo the domain 
bias by capturing the real semantic information instead of 
overfitting to domain- specific appearances. 

3. Deep Convolutional Activation Features 

In our approach, a deep convolutional model is first trained 
in a fully supervised setting using a state-of-the-art method 
Krizhevsky et al. (2012). We then extract various features 
from this network, and evaluate the efficacy of these fea- 
tures on generic vision tasks. Even though the forward pass 
computed by the architecture in this section does achieve 
state-of-the-art performance on ILSVRC-2012, two ques- 
tions remain: 

• Do features extracted from the CNN generalize to 
other datasets? 

• How do these features perform versus depth? 

We address these questions both qualitatively and quanti- 
tatively, via visualizations of semantic clusters below, and 
experimental comparision to current baselines in the fol- 
lowing section. 

3.1. An Open-source Convolutional Model 

To facilitate the wide-spread analysis of deep convolu- 
tional features, we developed a Python framework that 
allows one to easily train networks consisting of various 
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(a) LLC (b) GIST (c) DeCAF i (d) DeCAFg 


Figure 1. This figure shows several t-SNE feature visualizations on the ILSVRC-2012 validation set. (a) LLC , (b) GIST, and features 
derived from our CNN: (c) DeCALi, the first pooling layer, and (d) DeCAL6, the second to last hidden layer (best viewed in color). 


layer types and to execute pre-trained networks efficiently 
without being restricted to a GPU (which in many cases 
may hinder the deployment of trained models). Specif- 
ically, we adopted open-source Python packages such as 
numpy/scipy for efficient numerical computation, with 
parts of the computation-heavy code implemented in C and 
linked to Python. In terms of computation speed, our model 
is able to process about 40 images per second with an 8- 
core commodity machine when the CNN model is executed 
in a minibatch mode. 

Our implementation, decaf, will be publicly available 1 . 
In addition, we will release the network parameters used in 
our experiments to allow for out-of-the-box feature extrac- 
tion without the need to re-train the large network 2 . This 
also aligns with the philosophy of supervised transfer: one 
may view the trained model as an analog to the prior knowl- 
edge a human obtains from previous visual experiences, 
which helps in learning new tasks more efficiently. 

As the underlying architecture for our feature we adopt the 
deep convolutional neural network architecture proposed 
by Krizhevsky et al. (2012), which won the ImageNet 
Large Scale Visual Recognition Challenge 2012 (Berg 
et al., 2012) with a top-1 validation error rate of 40.7%. 
3 We chose this model due to its performance on a difficult 
1000-way classification task, hypothesizing that the activa- 
tions of the neurons in its late hidden layers might serve 
as very strong features for a variety of object recognition 
tasks. Its inputs are the mean-centered raw RGB pixel in- 

mttp : / /decaf . berkeleyvision . org/ 

r\ 

We note that although our CPU implementation allows one 
to also train networks, that training of large networks such as the 
ones for ImageNet may still be time-consuming on CPUs, and we 
rely on our own implementation of the network by extending the 
cuda-convnet GPU framework provided by Alex Krizhevsky 
to train such models. 

The model entered into the competition actually achieved a 
top-1 validation error rate of 36.7% by averaging the predictions 
of 7 structurally identical models that were initialized and trained 
independently. We trained only a single instance of the model; 
hence we refer to the single model error rate of 40.7%. 


tensity values of a 224 x 224 image. These values are for- 
ward propagated through 5 convolutional layers (with pool- 
ing and ReLU non-linearities applied along the way) and 3 
fully-connected layers to determine its final neuron activ- 
ities: a distribution over the task’s 1000 object categories. 
Our instance of the model attains an error rate of 42.9% on 
the ILSVRC-2012 validation set - 2.2% shy of the 40.7% 
achieved by (Krizhevsky et al., 2012). 

We refer to Krizhevsky et al. (2012) for a detailed discus- 
sion of the architecture and training protocol, which we 
closely followed with the exception of two small differ- 
ences in the input data. First, we ignore the image’s orig- 
inal aspect ratio and warp it to 256 x 256, rather than re- 
sizing and cropping to preserve the proportions. Secondly, 
we did not perform the data augmentation trick of adding 
random multiples of the principle components of the RGB 
pixel values throughout the dataset, proposed as a way of 
capturing invariance to changes in illumination and color 4 . 

3.2. Feature Generalization and Visualization 

We visualized the model features to gain insight into the 
semantic capacity of DeCAF and other features that have 
been typically employed in computer vision. In particular, 
we compare the features described in Section 3 with GIST 
features (Oliva & Torralba, 2001) and LLC features (Wang 
et al., 2010). 

We visualize features in the following way: we run the t- 
SNE algorithm (van der Maaten & Hinton, 2008) to find a 
2-dimensional embedding of the high-dimensional feature 
space, and plot them as points colored depending on their 
semantic category in a particular hierarchy. We did this on 
the validation set of ILSVRC-2012 to avoid overfitting ef- 
fects (as the deep CNN used in this paper was trained only 
on the training set), and also use an independent dataset, 
SUN-397 (Xiao et al., 2010), to evaluate how dataset bias 

4 According to the authors, this scheme reduced their models’ 
test set error by over 1%, likely explaining much of our network’s 
performance discrepancy. 
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Figure 2. In this figure we show how our features trained on 
ILSVRC-2012 generalized to SUN-397 when considering seman- 
tic groupings of labels (best viewed in color). 



Figure 3. (a) The computation time on each layer when running 
classification on one single input image. The layers with the most 
time consumption are labeled, (b) The distribution of computation 
time over different layer types. In the piechart, fc = fully con- 
nected layers, conv = convolution layers, pool = pooling layers, 
and neuron = neuron layers such as ReLU, sigmoid, and dropout. 


affects our results (see e.g. (Torralba & Efros, 2011) for a 
deeper discussion of this topic). 

One would expect features closer to the output (softmax) 
layer to be linearly separable, so it is not very interesting 
(and also visually quite hard) to represent the 1000 classes 
on the t-SNE derived embedding. 

We first visualize the semantic segregation of the model 
by plotting the embedding of labels for higher levels of 
the WordNet hierarchy; for example, a strong feature for 
visual recognition should cluster indoor and outdoor in- 
stances separately, even though there is no explicit mod- 
eling through the supervised training of the CNN. Figure 1 
shows the features extracted on the validation set using the 
first pooling layer, and the second to last fully connected 
layer, showing a clear semantic clustering in the latter but 
not in the former. This is compatible with common deep 
learning knowledge that the first layers learn “low-level” 
features, whereas the latter layers learn semantic or “high- 
level” features. Furthermore, other features such as GIST 
or LLC fail to capture the semantic difference in the image 
(although they show interesting clustering structure). 5 

More interestingly, in Figure 2 we can see the top per- 
forming features (DeCAF 6 ) on the SUN-397 dataset. Even 
there, the features show very good clustering of seman- 
tic classes (e.g., indoor vs. outdoor). This suggests De- 
CAF is a good feature for general object recognition tasks. 
Consider the case where the object class that we are try- 
ing to detect is not in the original object pool of IFSVRC- 
2012. The fact that these features cluster several interme- 
diate nodes of WordNet implies that these features are an 
excellent starting point for generalizing to unseen classes. 

5 Some of the features were very high dimensional (e.g. LLC 
had 16K dimension), in which case we preprocess them by ran- 
domly projecting them down to 512 dimensions - random pro- 
jections are cheap to apply and tend to preserve distances well, 
which is all the t-SNE algorithm cares about. 


3.3. Time Analysis 

While it is generally believed that convolutional neural net- 
works take a significant amount of time to execute, a de- 
tailed analysis of the computation time over the multiple 
layers involved is still missing in the literature. In this sub- 
section we report a break-down of the computation time 
analyzed using the decaf framework. 

In Figure 3(a) we lay out the computation time spent on 
individual layers with the most time-consuming layers la- 
beled. We observe that the convolution and fully-connected 
layers take most of the time to run, which is understandable 
as they involve large matrix-matrix multiplications 6 . Also, 
the time distribution over different layer types (Figure 3(b)) 
reveals an interesting fact: in large networks such as the 
current ImageNet CNN model, the last few fully-connected 
layers require the most computation time as they involve 
large transform matrices. This is particularly important 
when one considers classification into a larger number of 
categories or with larger hidden-layer sizes, suggesting that 
certain sparse approaches such as Bayesian output coding 
(Hsu et al., 2009) may be necessary to carry out classifica- 
tion into even larger number of object categories. 

4. Experiments 

In this section, we present experimental results evaluat- 
ing DeCAF on multiple standard computer vision bench- 
marks, comparing many possible featurization and classi- 
fication approaches. In each of the experiments, we take 
the activations of the n th hidden layer of the deep convo- 
lutional neural network described in Section 3 as a feature 
DeCAF n . DeCAFy denotes features taken from the final 
hidden layer - i.e., just before propagating through the fi- 
nal fully connected layer to produce the class predictions. 
DeCAF 6 is the activations of the layer before DeCAFy, and 

6 We implemented the convolutional layers as an im2col 
step followed by dense matrix multiplication, which empirically 
worked best with small kernel sizes and large number of kernels. 
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DeCAFs 

DeCAFs 

DeCAF 7 

LogReg 

63.29 ±6.6 

84.30 ± 1.6 

84.87 ±0.6 

LogReg with Dropout 

- 

86.08 ±0.8 

85.68 ±0.6 

SVM 

77.12 ± 1.1 

84.77 ± 1.2 

83.24 ± 1.2 

SVM with Dropout 

- 

86.91 ±0.7 

85.51 ±0.9 

Yang et al. (2009) 


84.3 


Jarrett et al. (2009) 


65.5 




Figure 4. Left: average accuracy per class on Caltech- 101 with 30 training samples per class across three hidden layers of the network 
and two classifiers. Our result from the training protocol/classifier combination with the best validation accuracy - SVM with Layer 6 
(+ dropout) features - is shown in bold. Right: average accuracy per class on Caltech- 101 at varying training set sizes. 


DeCAF 5 the layer before DeCAF 6 . DeCAF 5 is the first 
set of activations that has been fully propagated through 
the convolutional layers of the network. We chose not to 
evaluate features from any earlier in the network, as the 
earlier convolutional layers are unlikely to contain a richer 
semantic representation than the later features which form 
higher-level hypotheses from the low to mid-level local in- 
formation in the activations of the convolutional layers. Be- 
cause we are investigating the use of the network’s hidden 
layer activations as features, all of its weights are frozen 
to those learned on the Berg et al. (2012) dataset. 7 All im- 
ages are preprocessed using the procedure described for the 
ILSVRC images in Section 3, taking features on the center 
224 x 224 crop of the 256 x 256 resized image. 

We present results on multiple datasets to evaluate the 
strength of DeCAF for basic object recognition, domain 
adaptation, fine-grained recognition, and scene recogni- 
tion. These tasks each differ somewhat from that for which 
the architecture was trained, together representing much of 
the contemporary visual recognition spectrum. 

4.1. Object recognition 

To analyze the ability of the deep features to transfer to 
basic-level object category recognition, we evaluate them 
on the Caltech- 101 dataset (Fei-Fei et al., 2004). In addi- 
tion to directly evaluating linear classifier performance on 
DeCAF 6 and DeCAF 7 , we also report results using a reg- 
ularization technique called “dropout” proposed by Hinton 
et al. (2012). At training time, this technique works by ran- 
domly setting half of the activations (here, our features) in a 
given layer to 0. At test time, all activations are multiplied 
by 0.5. Dropout was used successfully by Krizhevsky et al. 
(2012) in layers 6 and 7 of their network; hence we study 
the effect of the technique when applied to the features de- 
rived from these layers. 

n 

We also experimented with the equivalent feature using ran- 
domized weights and found it to have performance comparable to 
traditional hand-designed features. 


In each evaluation, the classifier, a logistic regression (Lo- 
gReg) or support vector machine (SVM), is trained on a 
random set of 30 samples per class (including the back- 
ground class), and tested on the rest of the data, with pa- 
rameters cross-validated for each split on a 25 train/5 vali- 
dation subsplit of the training data. The results in Figure 4, 
left, are reported in terms of mean accuracy per category 
averaged over five data splits. 

Our top-performing method (based on validation accuracy) 
trains a linear SVM on DeCAF 6 with dropout, with test set 
accuracy of 86.9%. The DeCAF 5 features perform substan- 
tially worse than either the DeCAF 6 or DeCAF 7 features, 
and hence we do not evaluate them further in this paper. 
The DeCAF 7 features generally have accuracy about 1-2% 
lower than the DeCAF 6 features on this task. The dropout 
regularization technique uniformly improved results by 0- 
2% for each classifier/feature combination. When trained 
on DeCAF, the SVM and logistic regression classifiers per- 
form roughly equally well on this task. 

We compare our performance against the current state-of- 
the-art on this benchmark from Yang et al. (2009), a method 
employing a combination of 5 traditional hand-engineered 
image features followed by a multi-kernel based classifier. 
Our top-performing method training a linear SVM on a sin- 
gle feature outperforms this method by 2.6%. Our method 
also outperforms by over 20% the two-layer convolutional 
network of Jarrett et al. (2009), demonstrating the impor- 
tance of the depth of the network used for our feature. 
Note that unlike our method, these approaches from the 
literature do not implicitly leverage an outside large-scale 
image database like ImageNet. The performance edge of 
our method over these approaches demonstrates the impor- 
tance of multi-task learning when performing object recog- 
nition with sparse data like that available in the Caltech- 101 
benchmark. 

We also show how performance of the two DeCAF 6 with 
dropout methods above vary with the number of train- 
ing cases per category, plotted in Figure 4, right, trained 
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with fixed parameters and evaluated under the same metric 
as before. Our one-shot learning results (e.g., 33.0% for 
SVM) suggest that with sufficiently strong representations 
like DeCAF, useful models of visual categories can often 
be learned from just a single positive example. 

4.2. Domain adaptation 

We next evaluate DeCAF for use on the task of domain 
adaptation. For our experiments we use the benchmark Of- 
fice dataset (Saenko et al., 2010). The dataset contains three 
domains: Amazon, which consists of product images taken 
from amazon . com; and Webcam and Dslr, which con- 
sists of images taken in an office environment using a we- 
bcam or digital SLR camera, respectively. 

In the domain adaptation setting, we are given a training 
(source) domain with labeled training data and a distinct 
test (target) domain with either a small amount of labeled 
data or no labeled data. We will experiment within the su- 
pervised domain adaptation setting, where there is a small 
amount of labeled data available from the target domain. 

Most prior work for this dataset uses SURF (Bay et al., 
2006) interest point features (available for download with 
the dataset). To illustrate the ability of DeCAF to be ro- 
bust to resolution changes, we use the t-SNE (van der 
Maaten & Hinton, 2008) algorithm to project both SURF 
and DeCAF 6 , computed for Webcam and Dslr, into a 2D 
visualizable space (See Figure 5). We visualize an image 
on the point in space corresponding to its low dimension 
projected feature vector. We find that DeCAF not only pro- 
vides better within category clustering, but also clusters 
same category instances across domains. This indicates 
qualitatively that DeCAF removed some of the domain bias 
between the Webcam and Dslr domains. 

We validate this conclusion with a quantitative experiment 
on the Office dataset. Table 1 presents multi-class accu- 
racy averaged across 5 train/test splits for the domain shifts 
AmazonAWebcam and Dslr -A Webcam. We use the 
standard experimental setup first presented in Saenko et al. 
(2010). To compare SURF with the DeCAFg, and DeCAFy 
deep convolutional features, we report the multi-class accu- 
racy for each, using an SVM and Logistic Regression both 
trained in 3 ways: with only source data (S), only target 
data (T), and source and target data (ST). We also report 
results for three adaptive methods run with each DeCAF 
we consider as input. Finally, for completeness we report a 
recent and competing deep domain adaptation result from 
Chopra et al. (2013). DeCAF dramatically outperforms the 
baseline SURF feature available with the Office dataset as 
well as the deep adaptive method of Chopra et al. (2013). 



(a) SURF features 



(b) DeCAFe 


Figure 5. Visualization of the webcam (green) and dslr (blue) do- 
mains using the original released SURF features (a) and DeCAFe 
(b). The figure is best viewed by zooming in to see the images 
in local regions. All images from the scissor class are shown en- 
larged. They are well clustered and overlapping in both domains 
with our representation, while SURF only clusters a subset and 
places the others in disjoint parts of the space, closest to distinctly 
different categories such as chairs and mugs. 


4.3. Subcategory recognition 

We tested the performance of DeCAF on the task of subcat- 
egory recognition. To this end, we adopted one of its most 
popular tasks - the Caltech-UCSD birds dataset (Welinder 
et al., 2010), and compare the performance against several 
state-of-the-art baselines. 

Following common practice in the literature, we adopted 
two approaches to perform classification. Our first ap- 
proach adopts an ImageNet-like pipeline, in which we fol- 
lowed the existing protocol by cropping the images re- 
gions 1.5 x the size of the provided bounding boxes, re- 
sizing them 256x256 and then feeding them into the CNN 
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Amazon — »■ Webcam 

Dslr — »■ Webcam 


SURF 

DeCAFe 

DeCAF 7 

SURF 

DeCAFe 

DeCAF 7 

Logistic Reg. (S) 

SVM (S) 

9.63 ± 1.4 
11.05 ±2.3 

48.58 ± 1.3 
52.22 ± 1.7 

53.56 ± 1.5 
53.90 ±2.2 

24.22 ± 1.8 

38.80 ±0.7 

88.77 ± 1.2 
91.48 ±1.5 

87.38 ±2.2 
89.15 ±1.7 

Logistic Reg. (T) 
SVM (T) 

24.33 ±2.1 

51.05 ±2.0 

72.56 ±2.1 

78.26 ±2.6 

74.19 ±2.8 

78.72 ± 2.3 

24.33 ±2.1 

51.05 ±2.0 

72.56 ±2.1 

78.26 ±2.6 

74.19 ±2.8 

78.72 ±2.3 

Logistic Reg. (ST) 
SVM (ST) 

19.89 ± 1.7 

23.19 ±3.5 

75.30 ±2.0 

80.66 ±2.3 

76.32 ±2.0 

79.12 ±2.1 

36.55 ±2.2 
46.32 ± 1.1 

92.88 ± 0.6 

94.79 ± 1.2 

91.91 ±2.0 

92.96 ±2.0 

Daume III (2007) 
Hoffman et al. (2013) 
Gong et al. (2012) 

40.26 ± 1.1 

37.66 ±2.2 

39.80 ±2.3 

82.14 ± 1.9 

80.06 ±2.7 
75.21 ± 1.2 

81.65 ±2.4 

80.37 ±2.0 
77.55 ± 1.9 

55.07 ±3.0 

53.65 ±3.3 
39.12 ± 1.3 

91.25 ± 1.1 

93.25 ± 1.5 
88.40 ±1.0 

89.52 ±2.2 

91.45 ± 1.5 

88.66 ±1.9 

Chopra et al. (2013) 


58.85 



78.21 



Table 1 . DeCAF dramatically outperforms the baseline SURF feature available with the Office dataset as well as the deep adaptive 
method of Chopra et al. (2013). We report average multi class accuracy using both non-adaptive and adaptive classifiers, changing only 
the input feature from SURF to DeCAF. Most surprisingly, in the case of Dslr— )>Webcam the domain shift is largely non-existent with 
DeCAF. 


pipeline to get the features for classification. We computed 
DeCAF 6 and trained a multi-class logistic regression on top 
of the features. 

Our second approach, we tested DeCAF in a pose- 
normalized setting using the deformable part descriptors 
(DPD) method (Zhang et al., 2013). Inspired by the de- 
formable parts model (Felzenszwalb et al., 2010), DPD ex- 
plicitly utilizes the part localization to do semantic pool- 
ing. Specifically, after training a weakly-supervised DPM 
on bird images, the pool weight for each part of each com- 
ponent is calculated by using the key-point annotations to 
get cross-component semantic part correspondence. The fi- 
nal pose-normalized representation is computed by pooling 
the image features of predicted part boxes using the pool- 
ing weights. Based on the DPD implementation provided 
by the authors, we applied DeCAF in the same pre- trained 
DPM model and part predictions and used the same pool- 
ing weights. Figure 6 shows the DPM detections and visu- 
alization of pooled DPD features on a sample test image. 
As our first approach, we resized each predicted part box 
to 256 x 256 and computed DeCAF 6 to replace the KDES 
image features (Bo et al., 2010) used in DPD paper. 

Our performance as well as those from the literature are 
listed in Table 2. DeCAF together with a simple logistic re- 
gression already obtains a significant performance increase 
over existing approaches, indicating that such features, al- 
though not specifically designed to model subcategory- 
level differences, captures such information well. In addi- 
tion, explicitly taking more structured information such as 
part locations still helps, and provides another significant 
performance increase, obtaining an accuracy of 64.96%, 



(a) DPM detections 



(b) Parts (c) DPD 


Figure 6. Pipeline of deformable part descriptor (DPD) on a sam- 
ple test images. It uses DPM for part localization and then use 
learned pooling weights for final pose-normalized representation. 


Method 

Accuracy 

DeCAF 6 

58.75 

DPD + DeCAF 6 

64.96 

DPD (Zhang et al., 2013) 

50.98 

POOF (Berg & Belhumeur, 2013) 

56.78 


Table 2. Accuracy on the Caltech-UCSD bird dataset. 


compared to the 50.98% accuracy reported in (Zhang et al., 
2013). It also outperforms POOF (Berg & Belhumeur, 
2013), which is the best part-based approach for fine- 
grained categorization published so far. 

To the best of our knowledge, this is the best accuracy re- 
ported so far in the literature. 

We note again that in all the experiments above, no fine- 
tuning is carried out on the CNN layers since our main 
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interest is to analyze how DeCAF generalizes to different 
tasks. To obtain the best possible result one may want to 
perform a full back-propagation. However, the fact that we 
see a significant performance increase without fine-tuning 
suggests that DeCAF may serve as a good off-the-shelf vi- 
sual representation without heavy computation. 



DeCAFg 

DeCAFy 

LogReg 

40.94 ± 0.3 

40.84 ±0.3 

SVM 

39.36 ±0.3 

40.66 ±0.3 

Xiao et al. (2010) 

38.0 


4.4. Scene recognition 

Finally, we evaluate DeCAF on the SUN- 3 97 large-scale 
scene recognition database (Xiao et al., 2010). Unlike ob- 
ject recognition, wherein the goal is to identify and classify 
an object which is usually the primary focus of the image, 
the goal of a scene recognition task is to classify the scene 
of the entire image. In the SUN-397 database, there are 397 
semantic scene categories including abbey , diner , mosque , 
and stadium. Because DeCAF is learned on ILSVRC, an 
object recognition database, we are applying it to a task for 
which it was not designed. Hence we might expect this 
task to be very challenging for these features, unless they 
are highly generic representations of the visual world. 

Based on the success of using dropout with DeCAFe and 
DeCAFy for the object recognition task detailed in Sec- 
tion 4.1, we train and evaluate linear classifiers on these 
dropped-out features on the SUN- 3 97 database. Table 3 
gives the classification accuracy results averaged across 5 
splits of 50 training images and 50 test images. Parameters 
are fixed for all methods, but we select the top-performing 
method by cross-validation, training on 42 images and test- 
ing on the remaining 8 in each split. 

Our top-performing method in terms of cross-validation ac- 
curacy was to use DeCAFy with the SVM classifier, result- 
ing in 40.94% test performance. Comparing against the 
method of Xiao et al. (2010), the current state-of-the-art 
method, we see a performance improvement of 2.9% us- 
ing only DeCAF. Note that, like the state-of-the-art method 
used as a baseline in Section 4.1, this method uses a large 
set of traditional vision features and combines them with a 
multi-kernel learning method. The fact that a simple linear 
classifier on top of our single image feature outperforms 
the multi-kernel learning baseline built on top of many tra- 
ditional features demonstrates the ability of DeCAF to gen- 
eralize to other tasks and its representational power as com- 
pared to traditional hand-engineered features. 

5. Discussion 

In this work, we analyze the use of deep features applied in 
a semi-supervised multi-task framework. In particular, we 
demonstrate that by leveraging an auxiliary large labeled 
object database to train a deep convolutional architecture, 
we can learn features that have sufficient representational 
power and generalization ability to perform semantic visual 


Table 3. Average accuracy per class on SUN-397 with 50 training 
samples and 50 test samples per class, across two hidden layers 
of the network and two classifiers. Our result from the training 
protocol/classifier combination with the best validation accuracy 
- Logistic Regression with DeCAFy - is shown in bold. 

discrimination tasks using simple linear classifiers, reliably 
outperforming current state-of-the-art approaches based on 
sophisticated multi-kernel learning techniques with tradi- 
tional hand-engineered features. Our visual results demon- 
strate the generality and semantic knowledge implicit in 
these features, showing that the features tend to cluster im- 
ages into interesting semantic categories on which the net- 
work was never explicitly trained. Our numerical results 
consistently and robustly demonstrate that our multi-task 
feature learning framework can substantially improve the 
performance of a wide variety of existing methods across 
a spectrum of visual recognition tasks, including domain 
adaptation, fine-grained part-based recognition, and large- 
scale scene recognition. The ability of a visual recogni- 
tion system to achieve high classification accuracy on tasks 
with sparse labeled data has proven to be an elusive goal in 
computer vision research, but our multi-task deep learning 
framework and fast open-source implementation are signif- 
icant steps in this direction. While our current experiments 
focus on contemporary recognition challenges, we expect 
our feature to be very useful in detection, retrieval, and cat- 
egory discovery settings as well. 
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